fix(journal): serialize concurrent appends via lockfile by carlos-alm · Pull Request #1002 · optave/ops-codegraph-tool

carlos-alm · 2026-04-21T07:22:42Z

Summary

Fixes #996.

appendJournalEntries() relied on fs.appendFileSync alone — that's atomic within a single call but not across concurrent writers. When a watch session and a manual codegraph build in another shell both write to .codegraph/changes.journal, their lines can interleave, producing truncated DELETED prefixes, partial paths, or a \n-less tail. A corrupted header then makes readJournal return valid: false and silently falls through to a hash scan — a silent perf regression, not a crash.

Fix

Wrap both appendJournalEntries and writeJournalHeader in withJournalLock:

Acquires .codegraph/changes.journal.lock with fs.openSync(path, 'wx') — atomic exclusive-create on ext4/NTFS/APFS.
Stamps the holder PID inside the lockfile.
Retries every 25ms up to a 5s timeout.
Steals the lock if the holder PID is dead (process.kill(pid, 0)) or the file's mtime is older than 30s — covers crash-mid-write.
Releases the fd + unlinks the lockfile in finally.

Zero new dependencies. Atomics.wait on a SharedArrayBuffer provides synchronous retry sleep so the existing sync API is preserved (watcher and build stages call these functions synchronously).

Test plan

npx vitest run tests/unit/journal.test.ts — 20/20 pass (17 existing + 3 new: lockfile cleanup, dead-PID stale-lock stealing, no-interleave across 200 mixed appends)
npx vitest run tests/builder/detect-changes.test.ts tests/integration/build.test.ts tests/integration/watcher-rebuild.test.ts — 27/27 pass
npm run typecheck — clean

appendJournalEntries used fs.appendFileSync with no cross-process coordination, so a watcher session and a manual codegraph build in a second shell could interleave lines and corrupt .codegraph/changes.journal (truncated DELETED prefixes, partial entries, newline-less tails). Wrap both appendJournalEntries and writeJournalHeader in a withJournalLock helper that: - acquires .codegraph/changes.journal.lock via fs.openSync(path, 'wx') (atomic exclusive-create, cross-platform) - stamps the holder PID for stale-lock detection - retries every 25ms up to a 5s timeout - steals the lock if the holder PID is dead or the lock file is older than 30s (covers crash-mid-write) - always releases in finally Zero new dependencies — uses only node:fs and Atomics.wait on a SharedArrayBuffer for synchronous retry sleep, preserving the existing sync API. Fixes #996 Impact: 7 functions changed, 12 affected

claude · 2026-04-21T07:22:54Z

Claude finished @carlos-alm's task —— View job

PR Review: fix(journal): serialize concurrent appends via lockfile

Todo List

Read and analyze the modified journal implementation
Review test changes for integrity and coverage
Check for configuration changes or rule relaxations
Perform root cause analysis
Assess backlog compliance criteria
Provide final review assessment

Root Cause Analysis

Problem Being Solved: Race condition where concurrent journal writers (watch session + manual codegraph build) can interleave lines in .codegraph/changes.journal, causing corruption like truncated "DELETED " prefixes or paths without newlines. This leads to readJournal returning valid: false and falling back to expensive hash scans.

Why This Approach: Uses atomic file locking via fs.openSync(path, 'wx') which is an OS-level exclusive-create operation. The approach is sound - wraps both appendJournalEntries and writeJournalHeader in withJournalLock to serialize all journal writes.

Risk Assessment: LOW - Well-implemented defensive approach with proper timeout, dead PID detection, and stale lock cleanup.

Backlog Compliance

Zero-dep: ✓ — Uses only Node.js built-ins (fs, process.kill, SharedArrayBuffer, Atomics)
Foundation-aligned: ✓ — Solves core reliability issue without violating any principles
Problem-fit: 4 — Directly prevents AI agents from experiencing silent perf degradation due to corrupted journals
Breaking: No — Internal implementation change, no API changes
Tier: 1

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

Strong Points:

Atomic locking: fs.openSync(path, 'wx') is the correct way to do exclusive file creation
Dead PID detection: process.kill(pid, 0) with proper EPERM handling (lines 17-25)
Stale lock cleanup: 30s timeout prevents indefinite blocking from crashed processes
Sync sleep: Clever use of SharedArrayBuffer + Atomics.wait for synchronous retry without async/await
Proper resource cleanup: finally block ensures fd close + lock unlink (lines 101-105)

Implementation Details:

Lock timeout: 5s (reasonable for file I/O)
Retry interval: 25ms (appropriate granularity)
Stale threshold: 30s (good balance)
Lock file contains PID for diagnostics (line 34)

Test Coverage ⭐⭐⭐⭐☆

New Tests Added:

Lock cleanup verification (lines 200-207): Ensures .lock file is removed after successful operations
Dead PID lock stealing (lines 209-223): Tests stale lock recovery using impossible PID (2147483646)
Interleaving prevention (lines 225-243): 200 concurrent appends verify no corruption occurs

Test Quality:

Good edge case coverage (dead PID, lock cleanup, concurrent access)
Uses realistic scenario (200 mixed appends simulating high concurrency)
Proper assertion: regex validation that all lines match expected format
No test integrity issues - tests are additive and strengthen coverage

Potential Issues (Minor)

SharedArrayBuffer compatibility: While widely supported, could theoretically fail in some environments. However, fallback would be a simple setTimeout with Promise, and this preserves the existing sync API.
Lock file in same directory: Generally fine, though some filesystems might have issues with rapid create/delete cycles. The implementation handles races properly with try/catch blocks.
No explicit file sync: fs.writeSync and fs.appendFileSync don't guarantee disk persistence before proceeding, but this shouldn't affect correctness for the lockfile use case.

Critical Concerns

None. This is a solid fix for a real concurrency issue.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: Well-implemented fix for a legitimate race condition. Uses appropriate OS primitives, has good test coverage, adds no dependencies, and solves a real reliability problem that affects AI agent performance. The implementation is defensive and handles edge cases properly.

The only reason this isn't 5 stars is that it's solving an existing issue rather than adding exceptional new value, but the execution is excellent.

greptile-apps · 2026-04-21T07:26:44Z

Greptile Summary

This PR serializes concurrent journal writers using an advisory lockfile (changes.journal.lock) with atomic exclusive-create, nonce-verified steal-via-rename, busy-spin retry, and stale-tmp sweeping. The three previously raised P1s (TOCTOU steal race, Atomics.wait event-loop freeze, empty-lockfile nonce vulnerability) are all addressed in the current HEAD. Two narrower P2 edge cases remain in the steal and writeSync-failure paths.

Confidence Score: 4/5

Safe to merge for typical usage; two residual P2 races only manifest under resource exhaustion (EMFILE) or I/O errors combined with concurrent writers.

All three prior P1s are resolved. Remaining findings are P2: a self-lockout under EMFILE when openSync('r+') fails after a successful steal rename, and an unguarded unlinkSync in the writeSync-failure handler that could delete a concurrent stealer's live lock under ENOSPC. Both require unusual OS-level failure modes on top of concurrent writers, so they don't block merge but are worth a follow-up fix.

src/domain/graph/journal.ts — trySteal (lines 84–91) and acquireJournalLock writeSync-failure handler (lines 106–115)

Important Files Changed

Filename	Overview
src/domain/graph/journal.ts	Adds `withJournalLock` with atomic steal-via-rename, nonce verification, busy-spin retry, and stale-tmp sweeping; wraps all three write exports. Two P2 edge cases: self-lockout when `openSync('r+')` fails after winning a steal, and unguarded `unlinkSync` in the writeSync-failure handler that can delete a concurrent stealer's live lock.
tests/unit/journal.test.ts	Adds five lock-specific tests: cleanup, dead-PID steal, TOCTOU nonce-verification regression, interleave-free appends, and stale-tmp sweep. Coverage is solid for the happy paths and the previously reviewed races.

Sequence Diagram

sequenceDiagram
    participant W as Watcher
    participant B as Build
    participant FS as Filesystem

    W->>FS: openSync('wx') → fd [EEXIST if B holds lock]
    W->>FS: writeSync(fd, PID+nonce)
    Note over W: acquireJournalLock ✓

    B->>FS: openSync('wx') → EEXIST
    B->>FS: readFileSync(lock) → W's PID
    B->>FS: isPidAlive(W.PID) → true
    B->>FS: statSync(lock) → mtime fresh
    Note over B: shouldSteal=false → sleepSync(25ms) → retry loop

    W->>FS: appendFileSync(journal, lines)
    Note over W: fn() executes

    W->>FS: readFileSync(lock) → nonce match
    W->>FS: unlinkSync(lock)
    Note over W: releaseJournalLock ✓

    B->>FS: openSync('wx') → fd [lock now free]
    B->>FS: writeSync(fd, PID+nonce)
    Note over B: acquireJournalLock ✓
    B->>FS: appendFileSync(journal, lines)
    B->>FS: unlinkSync(lock)
    Note over B: releaseJournalLock ✓

_{Reviews (5): Last reviewed commit: "fix(journal): sweep orphaned lockfile .t..." | Re-trigger Greptile}

greptile-apps · 2026-04-21T07:26:48Z

+    let holderAlive = true;
+    try {
+      const pidContent = fs.readFileSync(lockPath, 'utf-8').trim();
+      holderAlive = isPidAlive(Number(pidContent));
+    } catch {
+      /* unreadable — fall through to age check */
+    }
+
+    if (!holderAlive) {
+      try {
+        fs.unlinkSync(lockPath);
+      } catch {
+        /* another writer stole it first */
+      }
+      continue;


TOCTOU race allows two processes to concurrently enter the critical section

When multiple processes are waiting on the same stale (dead-PID) lockfile, they can each independently steal the lock from one another's fresh acquisition:

Process A dies — lockfile contains dead PID A.

Processes B and C both call openSync('wx'), both get EEXIST, and both read dead PID A.

B unlinks and immediately re-acquires (new inode I1, fn() starts).

C now executes unlinkSync(lockPath) — it unlinks B's live lockfile (I1), not the stale one.

C loops, calls openSync('wx'), succeeds (new inode I2), and its fn() begins concurrently with B's.

Both B and C are now inside the critical section simultaneously — exactly the write-serialisation invariant this PR is meant to guarantee.

The standard mitigation is to verify ownership after the unlink/re-create cycle: write a nonce (or use the inode number) inside the lockfile, then re-read it after openSync succeeds and bail if it doesn't match yours. Alternatively, use fs.renameSync(tmpfile, lockPath) to perform the steal atomically (create a temp file, rename onto the lockfile — POSIX rename is atomic).

Fixed in f5c737c. Replaced the unlink + openSync('wx') steal pattern with an atomic write-tmp + fs.renameSync steal. Each writer generates a random nonce, writes it into a temp file, and atomically renames onto the lockfile. After rename we re-read the lockfile and only proceed if our nonce is still there — if another stealer's rename landed after ours, we bail and retry instead of unlinking their live lock. Release also nonce-verifies before unlinking. Added a regression test that stages a lockfile with a different-writer nonce after a real steal cycle and asserts we do not retroactively unlink it.

greptile-apps · 2026-04-21T07:26:49Z

+function sleepSync(ms: number): void {
+  const buf = new Int32Array(new SharedArrayBuffer(4));
+  Atomics.wait(buf, 0, 0, ms);
+}


Atomics.wait freezes the Node.js event loop during lock contention

Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.

A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():

function sleepSync(ms: number): void { const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n; while (process.hrtime.bigint() < end) { /* spin */ } }

This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).

Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.

github-actions · 2026-04-21T07:29:59Z

Codegraph Impact Analysis

12 functions changed → 14 callers affected across 5 files

sleepSync in src/domain/graph/journal.ts:17 (5 transitive callers)
isPidAlive in src/domain/graph/journal.ts:24 (5 transitive callers)
AcquiredLock.fd in src/domain/graph/journal.ts:36 (5 transitive callers)
AcquiredLock.nonce in src/domain/graph/journal.ts:37 (5 transitive callers)
trySteal in src/domain/graph/journal.ts:51 (5 transitive callers)
acquireJournalLock in src/domain/graph/journal.ts:94 (8 transitive callers)
releaseJournalLock in src/domain/graph/journal.ts:162 (8 transitive callers)
sweepStaleTmpFiles in src/domain/graph/journal.ts:180 (8 transitive callers)
withJournalLock in src/domain/graph/journal.ts:210 (9 transitive callers)
appendJournalEntries in src/domain/graph/journal.ts:277 (0 transitive callers)
writeJournalHeader in src/domain/graph/journal.ts:297 (4 transitive callers)
appendJournalEntriesAndStampHeader in src/domain/graph/journal.ts:327 (4 transitive callers)

Address two Greptile review issues on the journal lockfile: - P1 TOCTOU: when two stealers observed the same stale (dead-PID) holder, one's unlink could cross the other's fresh openSync('wx') acquisition, admitting both writers into the critical section. Replace the unlink + openSync('wx') pattern with an atomic write-tmp + rename steal, then verify via a random nonce. If another stealer's rename landed after ours, we bail and retry instead of unlinking their live lockfile. Release now also nonce-verifies before unlinking. - P2 event-loop blockage: replace Atomics.wait with a short hrtime busy-spin so pending FS events and timer callbacks in the watcher keep firing during the 25ms retry window. Add a regression test for the stale-lock steal race that asserts we never unlink a lockfile whose nonce does not match our own. Impact: 7 functions changed, 10 affected

carlos-alm · 2026-04-22T06:28:07Z

@greptileai

greptile-apps · 2026-04-22T07:34:37Z

+    try {
+      const fd = fs.openSync(lockPath, 'wx');
+      try {
+        fs.writeSync(fd, `${process.pid}\n${nonce}\n`);
+      } catch {
+        /* PID stamp is advisory; fd is still exclusive */
+      }
+      return { fd, nonce };
+    } catch (e) {
+      if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e;
+    }


Silent writeSync failure voids the nonce and breaks mutual exclusion

If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:

releaseJournalLock reads the empty file, content.includes(lock.nonce) is false, and the lockfile is orphaned rather than unlinked.

Any concurrent waiter reads the empty file, computes Number('') → 0, then calls isPidAlive(0) which returns false (the pid <= 0 guard). With holderAlive = false, it immediately calls trySteal — stealing a live, active lock and breaking mutual exclusion.

The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.

Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.

Fixed in e17da0a. When the nonce stamp writeSync throws, acquireJournalLock now closes the fd, unlinks the empty lockfile, and retries instead of returning with an un-stamped lockfile. This prevents concurrent waiters from reading an empty nonce, treating isPidAlive(0) as dead, and stealing the live lock.

Integrates main's appendJournalEntriesAndStampHeader and mkdirSync cleanup with this PR's lockfile serialization. The new function is also wrapped in withJournalLock so it serializes with appendJournalEntries and writeJournalHeader against the same journal file. Also addresses Greptile P1: silent writeSync failure in acquireJournalLock now releases the fd + unlinks the empty lockfile and retries, preventing empty-lockfile mutual-exclusion breakage. Impact: 10 functions changed, 16 affected

…ntry (#1002) Addresses Greptile P2: crash-mid-steal in trySteal leaves .codegraph/changes.journal.lock.<nonce>.tmp files behind. Without cleanup they accumulate silently across crash cycles. Adds sweepStaleTmpFiles called at the top of withJournalLock which removes any changes.journal.lock.*.tmp older than LOCK_STALE_MS. The age filter avoids racing an in-flight steal on another process. Impact: 2 functions changed, 10 affected

carlos-alm · 2026-04-23T06:08:38Z

Addressed Greptile's review feedback from the 5/5 review:

P1 silent writeSync failure (comment 3122282730): Fixed in e17da0a. When the nonce stamp fails, acquireJournalLock now closes the fd, unlinks the empty lockfile, and retries instead of proceeding with an un-stamped lockfile that concurrent waiters would treat as stale.
P2 orphaned .tmp files on crash mid-steal: Fixed in 7534d65. Added sweepStaleTmpFiles at the top of withJournalLock that removes changes.journal.lock.*.tmp files older than LOCK_STALE_MS. The age filter prevents racing an in-flight steal on another process. Regression test added.
P2 test is single-process / does not exercise inter-process concurrency: Kept the single-process format-integrity test as-is. A true multi-process test would require spawning child processes with cross-platform handling (especially Windows CI where child_process.spawnSync semantics differ), which is significant complexity for marginal coverage gain over the P1 regression test (does not unlink another writer's lockfile after a stale-lock steal race), which already validates the core invariant the lock is meant to protect. Happy to revisit if the watcher+build scenario ever regresses.

carlos-alm · 2026-04-23T06:09:15Z

@greptileai

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Merge branch 'main' into fix/journal-atomic-appends

82ffccd

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

carlos-alm added 3 commits April 22, 2026 01:50

Merge branch 'main' into fix/journal-atomic-appends

5177133

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(journal): serialize concurrent appends via lockfile#1002

fix(journal): serialize concurrent appends via lockfile#1002
carlos-alm wants to merge 6 commits intomainfrom
fix/journal-atomic-appends

carlos-alm commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

carlos-alm Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

carlos-alm Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

carlos-alm Apr 23, 2026

Uh oh!

carlos-alm commented Apr 23, 2026

Uh oh!

carlos-alm commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 21, 2026

Summary

Fix

Test plan

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(journal): serialize concurrent appends via lockfile

Todo List

Root Cause Analysis

Backlog Compliance

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

Test Coverage ⭐⭐⭐⭐☆

Potential Issues (Minor)

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Apr 23, 2026

Uh oh!

carlos-alm commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 21, 2026 •

edited

Loading

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading